Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery. https://machinelearningmastery.com/
SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Online News Popularity dataset is a regression situation where we are trying to predict the value of a continuous variable.
INTRODUCTION: This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the article’s popularity level in social networks. The dataset does not contain the original content, but some statistics associated with it. The original content can be publicly accessed and retrieved using the provided URLs.
Many thanks to K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal, for making the dataset and benchmarking information available.
In iteration Take1, the script focused on evaluating various machine learning algorithms and identifying the algorithm that produces the best accuracy result. Iteration Take1 established a baseline performance regarding accuracy and processing time.
In iteration Take2, we examined the feasibility of using a dimensionality reduction technique of ranking the attribute importance with a gradient boosting tree method. Afterward, we eliminated the features that do not contribute to the cumulative importance of 0.99 (or 99%).
For this iteration, we will explore the Recursive Feature Elimination (or RFE) technique by recursively removing attributes and building a model on those attributes that remain. To keep the training time manageable, we will limit the number of attributes to 50.
ANALYSIS: From the previous iteration Take1, the baseline performance of the machine learning algorithms achieved an average RMSE of 10446. Two algorithms (Random Forest and Stochastic Gradient Boosting) achieved the top RMSE scores after the first round of modeling. After a series of tuning trials, Random Forest turned in the top result using the training data. It achieved the best RMSE of 10299. Using the optimized tuning parameter available, the Random Forest algorithm processed the validation dataset with an RMSE of 12978, which was slightly worse than the accuracy of the training data and possibly due to over-fitting.
From the previous iteration Take2, the baseline performance of the machine learning algorithms achieved an average RMSE of 10409. Two algorithms (ElasticNet and Stochastic Gradient Boosting) achieved the top RMSE scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved the best RMSE of 10312. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm processed the validation dataset with an RMSE of 13007, which was worse than the accuracy of the training data and possibly due to over-fitting.
In the current iteration, the baseline performance of the machine learning algorithms achieved an average RMSE of 10503. Two algorithms (Ridge, LASSO, and ElasticNet) achieved the top RMSE scores after the first round of modeling. After a series of tuning trials, ElasticNet turned in the top result using the training data. It achieved the best RMSE of 10320. Using the optimized tuning parameter available, the ElasticNet algorithm processed the validation dataset with an RMSE of 13049, which was worse than the accuracy of the training data and possibly due to over-fitting.
From the model-building activities, the number of attributes went from 58 down to 48 after eliminating 10 attributes. The processing time went from 21 hours 7 minutes in iteration Take1 down to 14 hours 49 minutes in iteration Take3, which was a reduction of 29% from Take1. The processing time, however, was a slightly increase from Take2, which processed the dataset in 11 hours 41 minutes.
CONCLUSION: The two feature selection techniques yielded different attribute selection sets and outcomes. For this dataset, the Stochastic Gradient Boosting algorithm and the attribute importance ranking technique from iteration Take2 should be considered for further modeling or production use.
Dataset Used: Online News Popularity Dataset
Dataset ML Model: Regression with numerical attributes
Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity
The project aims to touch on the following areas:
Any predictive modeling machine learning project genrally can be broken down into about six major tasks:
startTimeScript <- proc.time()
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(corrplot)
## corrplot 0.84 loaded
library(parallel)
library(mailR)
# Create one random seed number for reproducible results
seedNum <- 888
set.seed(seedNum)
originalDataset <- read.csv("OnlineNewsPopularity.csv", header= TRUE)
# Dropping the two non-predictive attributes: url and timedelta
originalDataset$url <- NULL
originalDataset$timedelta <- NULL
# Different ways of reading and processing the input dataset. Saving these for future references.
#x_train <- read.fwf("X_train.txt", widths = widthVector, col.names = colNames)
#y_train <- read.csv("y_train.txt", header = FALSE, col.names = c("targetVar"))
#y_train$targetVar <- as.factor(y_train$targetVar)
#xy_train <- cbind(x_train, y_train)
# Use variable totCol to hold the number of columns in the dataframe
totCol <- ncol(originalDataset)
# Set up variable totAttr for the total number of attribute columns
totAttr <- totCol-1
# targetCol variable indicates the column location of the target/class variable
# If the first column, set targetCol to 1. If the last column, set targetCol to totCol
# if (targetCol <> 1) and (targetCol <> totCol), be aware when slicing up the dataframes for visualization!
targetCol <- totCol
colnames(originalDataset)[targetCol] <- "targetVar"
# We create training datasets (xy_train, x_train, y_train) for various operations.
# We create validation datasets (xy_test, x_test, y_test) for various operations.
set.seed(seedNum)
# Create a list of the rows in the original dataset we can use for training
training_index <- createDataPartition(originalDataset$targetVar, p=0.70, list=FALSE)
# Use 70% of the data to train the models and the remaining for testing/validation
xy_train <- originalDataset[training_index,]
xy_test <- originalDataset[-training_index,]
if (targetCol==1) {
x_train <- xy_train[,(targetCol+1):totCol]
y_train <- xy_train[,targetCol]
y_test <- xy_test[,targetCol]
} else {
x_train <- xy_train[,1:(totAttr)]
y_train <- xy_train[,totCol]
y_test <- xy_test[,totCol]
}
# Set up the number of row and columns for visualization display. dispRow * dispCol should be >= totAttr
dispCol <- 4
if (totAttr%%dispCol == 0) {
dispRow <- totAttr%/%dispCol
} else {
dispRow <- (totAttr%/%dispCol) + 1
}
cat("Will attempt to create graphics grid (col x row): ", dispCol, ' by ', dispRow)
## Will attempt to create graphics grid (col x row): 4 by 15
# Run algorithms using 10-fold cross validation
control <- trainControl(method="repeatedcv", number=10, repeats=1)
metricTarget <- "RMSE"
email_notify <- function(msg=""){
sender <- "luozhi2488@gmail.com"
receiver <- "dave@contactdavidlowe.com"
sbj_line <- "Notification from R Script"
password <- readLines("email_credential.txt")
send.mail(
from = sender,
to = receiver,
subject= sbj_line,
body = msg,
smtp = list(host.name = "smtp.gmail.com", port = 465, user.name = sender, passwd = password, ssl = TRUE),
authenticate = TRUE,
send = TRUE)
}
email_notify(paste("Library and Data Loading Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@47fd17e3}"
To gain a better understanding of the data that we have on-hand, we will leverage a number of descriptive statistics and data visualization techniques. The plan is to use the results to consider new questions, review assumptions, and validate hypotheses that we can investigate later with specialized models.
head(xy_train)
## n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words
## 2 9 255 0.6047431 1
## 3 9 211 0.5751295 1
## 5 13 1072 0.4156456 1
## 6 10 370 0.5598886 1
## 7 8 960 0.4181626 1
## 8 12 989 0.4335736 1
## n_non_stop_unique_tokens num_hrefs num_self_hrefs num_imgs num_videos
## 2 0.7919463 3 1 1 0
## 3 0.6638655 3 1 1 0
## 5 0.5408895 19 19 20 0
## 6 0.6981982 2 2 0 0
## 7 0.5498339 21 20 20 0
## 8 0.5721078 20 20 20 0
## average_token_length num_keywords data_channel_is_lifestyle
## 2 4.913725 4 0
## 3 4.393365 6 0
## 5 4.682836 7 0
## 6 4.359459 9 0
## 7 4.654167 10 1
## 8 4.617796 9 0
## data_channel_is_entertainment data_channel_is_bus data_channel_is_socmed
## 2 0 1 0
## 3 0 1 0
## 5 0 0 0
## 6 0 0 0
## 7 0 0 0
## 8 0 0 0
## data_channel_is_tech data_channel_is_world kw_min_min kw_max_min
## 2 0 0 0 0
## 3 0 0 0 0
## 5 1 0 0 0
## 6 1 0 0 0
## 7 0 0 0 0
## 8 1 0 0 0
## kw_avg_min kw_min_max kw_max_max kw_avg_max kw_min_avg kw_max_avg
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## 7 0 0 0 0 0 0
## 8 0 0 0 0 0 0
## kw_avg_avg self_reference_min_shares self_reference_max_shares
## 2 0 0 0
## 3 0 918 918
## 5 0 545 16000
## 6 0 8500 8500
## 7 0 545 16000
## 8 0 545 16000
## self_reference_avg_sharess weekday_is_monday weekday_is_tuesday
## 2 0.000 1 0
## 3 918.000 1 0
## 5 3151.158 1 0
## 6 8500.000 1 0
## 7 3151.158 1 0
## 8 3151.158 1 0
## weekday_is_wednesday weekday_is_thursday weekday_is_friday
## 2 0 0 0
## 3 0 0 0
## 5 0 0 0
## 6 0 0 0
## 7 0 0 0
## 8 0 0 0
## weekday_is_saturday weekday_is_sunday is_weekend LDA_00 LDA_01
## 2 0 0 0 0.79975569 0.05004668
## 3 0 0 0 0.21779229 0.03333446
## 5 0 0 0 0.02863281 0.02879355
## 6 0 0 0 0.02224528 0.30671758
## 7 0 0 0 0.02008167 0.11470539
## 8 0 0 0 0.02222436 0.15073297
## LDA_02 LDA_03 LDA_04 global_subjectivity
## 2 0.05009625 0.05010067 0.05000071 0.3412458
## 3 0.03335142 0.03333354 0.68218829 0.7022222
## 5 0.02857518 0.02857168 0.88542678 0.5135021
## 6 0.02223128 0.02222429 0.62658158 0.4374086
## 7 0.02002437 0.02001533 0.82517325 0.5144803
## 8 0.24343548 0.02222360 0.56138359 0.5434742
## global_sentiment_polarity global_rate_positive_words
## 2 0.14894781 0.04313725
## 3 0.32333333 0.05687204
## 5 0.28100348 0.07462687
## 6 0.07118419 0.02972973
## 7 0.26830272 0.08020833
## 8 0.29861347 0.08392315
## global_rate_negative_words rate_positive_words rate_negative_words
## 2 0.015686275 0.7333333 0.2666667
## 3 0.009478673 0.8571429 0.1428571
## 5 0.012126866 0.8602151 0.1397849
## 6 0.027027027 0.5238095 0.4761905
## 7 0.016666667 0.8279570 0.1720430
## 8 0.015166835 0.8469388 0.1530612
## avg_positive_polarity min_positive_polarity max_positive_polarity
## 2 0.2869146 0.03333333 0.7
## 3 0.4958333 0.10000000 1.0
## 5 0.4111274 0.03333333 1.0
## 6 0.3506100 0.13636364 0.6
## 7 0.4020386 0.10000000 1.0
## 8 0.4277205 0.10000000 1.0
## avg_negative_polarity min_negative_polarity max_negative_polarity
## 2 -0.1187500 -0.125 -0.1000000
## 3 -0.4666667 -0.800 -0.1333333
## 5 -0.2201923 -0.500 -0.0500000
## 6 -0.1950000 -0.400 -0.1000000
## 7 -0.2244792 -0.500 -0.0500000
## 8 -0.2427778 -0.500 -0.0500000
## title_subjectivity title_sentiment_polarity abs_title_subjectivity
## 2 0.0000000 0.0000000 0.50000000
## 3 0.0000000 0.0000000 0.50000000
## 5 0.4545455 0.1363636 0.04545455
## 6 0.6428571 0.2142857 0.14285714
## 7 0.0000000 0.0000000 0.50000000
## 8 1.0000000 0.5000000 0.50000000
## abs_title_sentiment_polarity targetVar
## 2 0.0000000 711
## 3 0.0000000 1500
## 5 0.1363636 505
## 6 0.2142857 855
## 7 0.0000000 556
## 8 0.5000000 891
dim(xy_train)
## [1] 27752 59
dim(xy_test)
## [1] 11892 59
sapply(xy_train, class)
## n_tokens_title n_tokens_content
## "numeric" "numeric"
## n_unique_tokens n_non_stop_words
## "numeric" "numeric"
## n_non_stop_unique_tokens num_hrefs
## "numeric" "numeric"
## num_self_hrefs num_imgs
## "numeric" "numeric"
## num_videos average_token_length
## "numeric" "numeric"
## num_keywords data_channel_is_lifestyle
## "numeric" "numeric"
## data_channel_is_entertainment data_channel_is_bus
## "numeric" "numeric"
## data_channel_is_socmed data_channel_is_tech
## "numeric" "numeric"
## data_channel_is_world kw_min_min
## "numeric" "numeric"
## kw_max_min kw_avg_min
## "numeric" "numeric"
## kw_min_max kw_max_max
## "numeric" "numeric"
## kw_avg_max kw_min_avg
## "numeric" "numeric"
## kw_max_avg kw_avg_avg
## "numeric" "numeric"
## self_reference_min_shares self_reference_max_shares
## "numeric" "numeric"
## self_reference_avg_sharess weekday_is_monday
## "numeric" "numeric"
## weekday_is_tuesday weekday_is_wednesday
## "numeric" "numeric"
## weekday_is_thursday weekday_is_friday
## "numeric" "numeric"
## weekday_is_saturday weekday_is_sunday
## "numeric" "numeric"
## is_weekend LDA_00
## "numeric" "numeric"
## LDA_01 LDA_02
## "numeric" "numeric"
## LDA_03 LDA_04
## "numeric" "numeric"
## global_subjectivity global_sentiment_polarity
## "numeric" "numeric"
## global_rate_positive_words global_rate_negative_words
## "numeric" "numeric"
## rate_positive_words rate_negative_words
## "numeric" "numeric"
## avg_positive_polarity min_positive_polarity
## "numeric" "numeric"
## max_positive_polarity avg_negative_polarity
## "numeric" "numeric"
## min_negative_polarity max_negative_polarity
## "numeric" "numeric"
## title_subjectivity title_sentiment_polarity
## "numeric" "numeric"
## abs_title_subjectivity abs_title_sentiment_polarity
## "numeric" "numeric"
## targetVar
## "integer"
summary(xy_train)
## n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words
## Min. : 3.0 Min. : 0.0 Min. : 0.0000 Min. : 0.000
## 1st Qu.: 9.0 1st Qu.: 246.0 1st Qu.: 0.4707 1st Qu.: 1.000
## Median :10.0 Median : 409.0 Median : 0.5393 Median : 1.000
## Mean :10.4 Mean : 547.2 Mean : 0.5555 Mean : 1.008
## 3rd Qu.:12.0 3rd Qu.: 716.0 3rd Qu.: 0.6081 3rd Qu.: 1.000
## Max. :23.0 Max. :8474.0 Max. :701.0000 Max. :1042.000
## n_non_stop_unique_tokens num_hrefs num_self_hrefs
## Min. : 0.0000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.6255 1st Qu.: 4.00 1st Qu.: 1.000
## Median : 0.6903 Median : 7.00 Median : 3.000
## Mean : 0.6957 Mean : 10.88 Mean : 3.302
## 3rd Qu.: 0.7542 3rd Qu.: 14.00 3rd Qu.: 4.000
## Max. :650.0000 Max. :304.00 Max. :74.000
## num_imgs num_videos average_token_length num_keywords
## Min. : 0.000 Min. : 0.000 Min. :0.000 Min. : 1.000
## 1st Qu.: 1.000 1st Qu.: 0.000 1st Qu.:4.477 1st Qu.: 6.000
## Median : 1.000 Median : 0.000 Median :4.662 Median : 7.000
## Mean : 4.563 Mean : 1.262 Mean :4.546 Mean : 7.227
## 3rd Qu.: 4.000 3rd Qu.: 1.000 3rd Qu.:4.854 3rd Qu.: 9.000
## Max. :111.000 Max. :91.000 Max. :6.610 Max. :10.000
## data_channel_is_lifestyle data_channel_is_entertainment
## Min. :0.00000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:0.000
## Median :0.00000 Median :0.000
## Mean :0.05387 Mean :0.178
## 3rd Qu.:0.00000 3rd Qu.:0.000
## Max. :1.00000 Max. :1.000
## data_channel_is_bus data_channel_is_socmed data_channel_is_tech
## Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.1579 Mean :0.05801 Mean :0.1864
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.00000 Max. :1.0000
## data_channel_is_world kw_min_min kw_max_min kw_avg_min
## Min. :0.0000 Min. : -1.00 Min. : 0 Min. : -1.0
## 1st Qu.:0.0000 1st Qu.: -1.00 1st Qu.: 450 1st Qu.: 141.9
## Median :0.0000 Median : -1.00 Median : 662 Median : 235.1
## Mean :0.2092 Mean : 26.13 Mean : 1159 Mean : 313.8
## 3rd Qu.:0.0000 3rd Qu.: 4.00 3rd Qu.: 1000 3rd Qu.: 356.8
## Max. :1.0000 Max. :377.00 Max. :298400 Max. :42827.9
## kw_min_max kw_max_max kw_avg_max kw_min_avg
## Min. : 0 Min. : 0 Min. : 0 Min. : -1
## 1st Qu.: 0 1st Qu.:843300 1st Qu.:172048 1st Qu.: 0
## Median : 1400 Median :843300 Median :245025 Median :1034
## Mean : 13458 Mean :752066 Mean :259524 Mean :1122
## 3rd Qu.: 7900 3rd Qu.:843300 3rd Qu.:331986 3rd Qu.:2066
## Max. :843300 Max. :843300 Max. :843300 Max. :3613
## kw_max_avg kw_avg_avg self_reference_min_shares
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 3564 1st Qu.: 2386 1st Qu.: 638
## Median : 4358 Median : 2870 Median : 1200
## Mean : 5640 Mean : 3137 Mean : 4084
## 3rd Qu.: 6021 3rd Qu.: 3605 3rd Qu.: 2600
## Max. :298400 Max. :43568 Max. :843300
## self_reference_max_shares self_reference_avg_sharess weekday_is_monday
## Min. : 0 Min. : 0 Min. :0.0000
## 1st Qu.: 1100 1st Qu.: 985 1st Qu.:0.0000
## Median : 2800 Median : 2200 Median :0.0000
## Mean : 10164 Mean : 6380 Mean :0.1689
## 3rd Qu.: 7900 3rd Qu.: 5100 3rd Qu.:0.0000
## Max. :843300 Max. :843300 Max. :1.0000
## weekday_is_tuesday weekday_is_wednesday weekday_is_thursday
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.1865 Mean :0.1886 Mean :0.1833
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## weekday_is_friday weekday_is_saturday weekday_is_sunday is_weekend
## Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :0.00000 Median :0.0000
## Mean :0.1434 Mean :0.06191 Mean :0.06735 Mean :0.1293
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.0000
## LDA_00 LDA_01 LDA_02 LDA_03
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.02505 1st Qu.:0.02501 1st Qu.:0.02857 1st Qu.:0.02857
## Median :0.03339 Median :0.03334 Median :0.04000 Median :0.04000
## Mean :0.18415 Mean :0.14087 Mean :0.21465 Mean :0.22515
## 3rd Qu.:0.24039 3rd Qu.:0.15034 3rd Qu.:0.32802 3rd Qu.:0.38152
## Max. :0.92699 Max. :0.92595 Max. :0.92000 Max. :0.91998
## LDA_04 global_subjectivity global_sentiment_polarity
## Min. :0.00000 Min. :0.0000 Min. :-0.38021
## 1st Qu.:0.02857 1st Qu.:0.3955 1st Qu.: 0.05712
## Median :0.04073 Median :0.4534 Median : 0.11867
## Mean :0.23514 Mean :0.4430 Mean : 0.11861
## 3rd Qu.:0.40359 3rd Qu.:0.5083 3rd Qu.: 0.17700
## Max. :0.92712 Max. :1.0000 Max. : 0.65500
## global_rate_positive_words global_rate_negative_words rate_positive_words
## Min. :0.00000 Min. :0.000000 Min. :0.0000
## 1st Qu.:0.02834 1st Qu.:0.009662 1st Qu.:0.6000
## Median :0.03888 Median :0.015326 Median :0.7097
## Mean :0.03955 Mean :0.016647 Mean :0.6815
## 3rd Qu.:0.05025 3rd Qu.:0.021739 3rd Qu.:0.8000
## Max. :0.15217 Max. :0.184932 Max. :1.0000
## rate_negative_words avg_positive_polarity min_positive_polarity
## Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.1852 1st Qu.:0.3056 1st Qu.:0.05000
## Median :0.2800 Median :0.3583 Median :0.10000
## Mean :0.2884 Mean :0.3532 Mean :0.09536
## 3rd Qu.:0.3846 3rd Qu.:0.4108 3rd Qu.:0.10000
## Max. :1.0000 Max. :1.0000 Max. :1.00000
## max_positive_polarity avg_negative_polarity min_negative_polarity
## Min. :0.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:0.6000 1st Qu.:-0.3282 1st Qu.:-0.7000
## Median :0.8000 Median :-0.2536 Median :-0.5000
## Mean :0.7553 Mean :-0.2596 Mean :-0.5222
## 3rd Qu.:1.0000 3rd Qu.:-0.1873 3rd Qu.:-0.3000
## Max. :1.0000 Max. : 0.0000 Max. : 0.0000
## max_negative_polarity title_subjectivity title_sentiment_polarity
## Min. :-1.0000 Min. :0.0000 Min. :-1.00000
## 1st Qu.:-0.1250 1st Qu.:0.0000 1st Qu.: 0.00000
## Median :-0.1000 Median :0.1429 Median : 0.00000
## Mean :-0.1073 Mean :0.2819 Mean : 0.07093
## 3rd Qu.:-0.0500 3rd Qu.:0.5000 3rd Qu.: 0.13750
## Max. : 0.0000 Max. :1.0000 Max. : 1.00000
## abs_title_subjectivity abs_title_sentiment_polarity targetVar
## Min. :0.0000 Min. :0.0000 Min. : 4
## 1st Qu.:0.1667 1st Qu.:0.0000 1st Qu.: 946
## Median :0.5000 Median :0.0000 Median : 1400
## Mean :0.3419 Mean :0.1558 Mean : 3366
## 3rd Qu.:0.5000 3rd Qu.:0.2500 3rd Qu.: 2800
## Max. :0.5000 Max. :1.0000 Max. :690400
sapply(xy_train, function(x) sum(is.na(x)))
## n_tokens_title n_tokens_content
## 0 0
## n_unique_tokens n_non_stop_words
## 0 0
## n_non_stop_unique_tokens num_hrefs
## 0 0
## num_self_hrefs num_imgs
## 0 0
## num_videos average_token_length
## 0 0
## num_keywords data_channel_is_lifestyle
## 0 0
## data_channel_is_entertainment data_channel_is_bus
## 0 0
## data_channel_is_socmed data_channel_is_tech
## 0 0
## data_channel_is_world kw_min_min
## 0 0
## kw_max_min kw_avg_min
## 0 0
## kw_min_max kw_max_max
## 0 0
## kw_avg_max kw_min_avg
## 0 0
## kw_max_avg kw_avg_avg
## 0 0
## self_reference_min_shares self_reference_max_shares
## 0 0
## self_reference_avg_sharess weekday_is_monday
## 0 0
## weekday_is_tuesday weekday_is_wednesday
## 0 0
## weekday_is_thursday weekday_is_friday
## 0 0
## weekday_is_saturday weekday_is_sunday
## 0 0
## is_weekend LDA_00
## 0 0
## LDA_01 LDA_02
## 0 0
## LDA_03 LDA_04
## 0 0
## global_subjectivity global_sentiment_polarity
## 0 0
## global_rate_positive_words global_rate_negative_words
## 0 0
## rate_positive_words rate_negative_words
## 0 0
## avg_positive_polarity min_positive_polarity
## 0 0
## max_positive_polarity avg_negative_polarity
## 0 0
## min_negative_polarity max_negative_polarity
## 0 0
## title_subjectivity title_sentiment_polarity
## 0 0
## abs_title_subjectivity abs_title_sentiment_polarity
## 0 0
## targetVar
## 0
# Boxplots for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
boxplot(x_train[,i], main=names(x_train)[i])
}
# Histograms each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
hist(x_train[,i], main=names(x_train)[i])
}
# Density plot for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
plot(density(x_train[,i]), main=names(x_train)[i])
}
# Correlation plot
correlations <- cor(x_train)
corrplot(correlations, method="circle")
email_notify(paste("Data Summary and Visualization Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@34340fab}"
Some dataset may require additional preparation activities that will best exposes the structure of the problem and the relationships between the input attributes and the output variable. Some data-prep tasks might include:
# Not applicable for this iteration of the project.
# Mark missing values
#invalid <- 0
#entireDataset$some_col[entireDataset$some_col==invalid] <- NA
# Impute missing values
#entireDataset$some_col <- with(entireDataset, impute(some_col, mean))
# Using the Linear Regression (lm) algorithm, we perform the Recursive Feature Elimination (RFE) technique
startTimeModule <- proc.time()
set.seed(seedNum)
rfeCTRL <- rfeControl(functions=lmFuncs, method="repeatedcv", repeats=2)
rfeResults <- rfe(xy_train[,1:totAttr], xy_train[,totCol], sizes=c(30:50), rfeControl=rfeCTRL)
## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading
print(rfeResults)
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold, repeated 2 times)
##
## Resampling performance over subset size:
##
## Variables RMSE Rsquared MAE RMSESD RsquaredSD MAESD Selected
## 30 12926 0.01561 3137 8250 0.01166 266.8
## 31 13098 0.01564 3140 8697 0.01168 272.7
## 32 13393 0.01577 3144 9607 0.01168 287.7
## 33 13358 0.01596 3143 9452 0.01189 284.3
## 34 13329 0.01613 3142 9375 0.01193 283.4
## 35 12025 0.01614 3115 6823 0.01196 225.4
## 36 11939 0.01626 3113 6742 0.01208 222.3
## 37 11768 0.01639 3107 6651 0.01201 217.3
## 38 11416 0.01645 3100 5393 0.01216 202.2
## 39 10417 0.01662 3076 3596 0.01203 182.8
## 40 10465 0.01762 3074 3599 0.01313 185.4
## 41 10614 0.01786 3077 3587 0.01309 185.0
## 42 10622 0.01798 3077 3588 0.01305 184.8
## 43 10634 0.01845 3072 3592 0.01310 187.1
## 44 10314 0.02003 3050 3620 0.01200 176.6
## 45 10357 0.02484 3029 3591 0.01656 177.1
## 46 10328 0.02377 3029 3605 0.01520 177.2
## 47 10246 0.02459 3026 3659 0.01436 177.4 *
## 48 10430 0.02348 3034 3679 0.01503 184.9
## 49 10458 0.02460 3034 3685 0.01583 184.9
## 50 10408 0.02467 3033 3688 0.01550 183.8
## 58 10276 0.02648 3024 3683 0.01559 180.7
##
## The top 5 variables (out of 47):
## LDA_04, LDA_02, LDA_01, LDA_00, LDA_03
rfeAttributes <- predictors(rfeResults)
cat('Number of attributes identified from the RFE algorithm:',length(rfeAttributes))
## Number of attributes identified from the RFE algorithm: 47
print(rfeAttributes)
## [1] "LDA_04" "LDA_02"
## [3] "LDA_01" "LDA_00"
## [5] "LDA_03" "global_rate_positive_words"
## [7] "n_unique_tokens" "global_rate_negative_words"
## [9] "global_subjectivity" "min_positive_polarity"
## [11] "n_non_stop_unique_tokens" "rate_positive_words"
## [13] "global_sentiment_polarity" "n_non_stop_words"
## [15] "data_channel_is_entertainment" "rate_negative_words"
## [17] "avg_negative_polarity" "data_channel_is_lifestyle"
## [19] "title_sentiment_polarity" "weekday_is_saturday"
## [21] "average_token_length" "abs_title_subjectivity"
## [23] "avg_positive_polarity" "data_channel_is_socmed"
## [25] "data_channel_is_bus" "max_negative_polarity"
## [27] "max_positive_polarity" "min_negative_polarity"
## [29] "weekday_is_monday" "weekday_is_thursday"
## [31] "data_channel_is_world" "weekday_is_tuesday"
## [33] "data_channel_is_tech" "weekday_is_friday"
## [35] "abs_title_sentiment_polarity" "n_tokens_title"
## [37] "title_subjectivity" "weekday_is_wednesday"
## [39] "num_self_hrefs" "num_keywords"
## [41] "num_hrefs" "num_videos"
## [43] "num_imgs" "kw_min_min"
## [45] "kw_avg_avg" "n_tokens_content"
## [47] "kw_avg_min"
plot(rfeResults, type=c("g", "o"))
# Removing the unselected attributes from the training and validation dataframes
rfeAttributes <- c(rfeAttributes,"targetVar")
xy_train <- xy_train[, (names(xy_train) %in% rfeAttributes)]
xy_test <- xy_test[, (names(xy_test) %in% rfeAttributes)]
# Not applicable for this iteration of the project.
proc.time()-startTimeScript
## user system elapsed
## 79.008 0.745 83.113
email_notify(paste("Data Cleaning and Transformation Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2aaf7cc2}"
After the data-prep, we next work on finding a workable model by evaluating a subset of machine learning algorithms that are good at exploiting the structure of the dataset. The typical evaluation tasks include:
For this project, we will evaluate four linear, three non-linear, and three ensemble algorithms:
Linear Algorithms: Linear Regression, Ridge, LASSO, and ElasticNet
Non-Linear Algorithms: Decision Trees (CART), k-Nearest Neighbors, and Support Vector Machine
Ensemble Algorithms: Bagged CART, Random Forest, and Stochastic Gradient Boosting
The random number seed is reset before each run to ensure that the evaluation of each algorithm is performed using the same data splits. It ensures the results are directly comparable.
# Linear Regression (Regression)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.lm <- train(targetVar~., data=xy_train, method="lm", metric=metricTarget, trControl=control)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
print(fit.lm)
## Linear Regression
##
## 27752 samples
## 47 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 11086.22 0.02402052 3053.665
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
proc.time()-startTimeModule
## user system elapsed
## 3.672 0.116 3.831
email_notify(paste("Linear Regression Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@357246de}"
# Ridge (Regression)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.ridge <- train(targetVar~., data=xy_train, method="ridge", metric=metricTarget, trControl=control)
print(fit.ridge)
## Ridge Regression
##
## 27752 samples
## 47 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0e+00 11086.22 0.02402052 3053.665
## 1e-04 10920.75 0.02410280 3048.083
## 1e-01 10327.80 0.02531409 3021.739
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
proc.time()-startTimeModule
## user system elapsed
## 26.246 0.690 27.235
email_notify(paste("Ridge Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@23223dd8}"
# lasso (Regression)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.lasso <- train(targetVar~., data=xy_train, method="lasso", metric=metricTarget, trControl=control)
print(fit.lasso)
## The lasso
##
## 27752 samples
## 47 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.1 10331.70 0.02463008 3026.996
## 0.5 10339.22 0.02441088 3024.783
## 0.9 10688.93 0.02404296 3042.317
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.1.
proc.time()-startTimeModule
## user system elapsed
## 10.402 0.666 11.194
email_notify(paste("Lasso Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@19bb089b}"
# ElasticNet (Regression)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.en <- train(targetVar~., data=xy_train, method="enet", metric=metricTarget, trControl=control)
print(fit.en)
## Elasticnet
##
## 27752 samples
## 47 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0e+00 0.050 10331.90 0.02491503 3029.262
## 0e+00 0.525 10388.08 0.02413528 3029.526
## 0e+00 1.000 11086.22 0.02402052 3053.665
## 1e-04 0.050 10358.28 0.02396919 3073.611
## 1e-04 0.525 10380.94 0.02424434 3027.400
## 1e-04 1.000 10920.75 0.02410280 3048.083
## 1e-01 0.050 10396.22 0.02379999 3114.598
## 1e-01 0.525 10323.25 0.02647541 3018.860
## 1e-01 1.000 10327.80 0.02531409 3021.739
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.525 and lambda = 0.1.
proc.time()-startTimeModule
## user system elapsed
## 27.053 1.259 28.657
email_notify(paste("ElasticNet Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2ff4f00f}"
# Decision Tree - CART (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.cart <- train(targetVar~., data=xy_train, method="rpart", metric=metricTarget, trControl=control)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
print(fit.cart)
## CART
##
## 27752 samples
## 47 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.008022312 10442.48 0.014992953 3065.863
## 0.009380481 10410.95 0.015070346 3081.016
## 0.012215379 10405.46 0.009551093 3112.198
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.01221538.
proc.time()-startTimeModule
## user system elapsed
## 17.263 0.138 17.594
email_notify(paste("Decision Tree Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@3cb5cdba}"
# k-Nearest Neighbors (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.knn <- train(targetVar~., data=xy_train, method="knn", metric=metricTarget, trControl=control)
print(fit.knn)
## k-Nearest Neighbors
##
## 27752 samples
## 47 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 11615.10 0.003031659 3415.888
## 7 11252.04 0.003658338 3344.773
## 9 11012.61 0.004149603 3288.142
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.
proc.time()-startTimeModule
## user system elapsed
## 138.576 0.122 140.204
email_notify(paste("k-Nearest Neighbors Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@1cd072a9}"
# Support Vector Machine (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.svm <- train(targetVar~., data=xy_train, method="svmRadial", metric=metricTarget, trControl=control)
print(fit.svm)
## Support Vector Machines with Radial Basis Function Kernel
##
## 27752 samples
## 47 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 10441.92 0.02206080 2459.608
## 0.50 10434.64 0.02118205 2469.632
## 1.00 10426.44 0.01957504 2487.661
##
## Tuning parameter 'sigma' was held constant at a value of 0.01409941
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01409941 and C = 1.
proc.time()-startTimeModule
## user system elapsed
## 8714.400 13.778 8831.428
email_notify(paste("Support Vector Machine Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@5594a1b5}"
In this section, we will explore the use and tuning of ensemble algorithms to see whether we can improve the results.
# Bagged CART (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.bagcart <- train(targetVar~., data=xy_train, method="treebag", metric=metricTarget, trControl=control)
print(fit.bagcart)
## Bagged CART
##
## 27752 samples
## 47 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 10453.83 0.01079349 3073.201
proc.time()-startTimeModule
## user system elapsed
## 114.815 0.672 116.757
email_notify(paste("Bagged CART Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@39ba5a14}"
# Random Forest (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.rf <- train(targetVar~., data=xy_train, method="rf", metric=metricTarget, trControl=control)
print(fit.rf)
## Random Forest
##
## 27752 samples
## 47 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 10332.95 0.02387294 3122.195
## 24 10568.35 0.01543550 3332.093
## 47 11019.27 0.01016815 3391.826
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.
proc.time()-startTimeModule
## user system elapsed
## 43297.04 37.91 43808.78
email_notify(paste("Random Forest Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@71be98f5}"
# Stochastic Gradient Boosting (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.gbm <- train(targetVar~., data=xy_train, method="gbm", metric=metricTarget, trControl=control, verbose=F)
print(fit.gbm)
## Stochastic Gradient Boosting
##
## 27752 samples
## 47 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared MAE
## 1 50 10340.06 0.02222229 3042.887
## 1 100 10341.54 0.02200986 3044.364
## 1 150 10339.66 0.02258045 3039.084
## 2 50 10418.27 0.01317292 3069.294
## 2 100 10471.26 0.01139697 3095.352
## 2 150 10519.68 0.01012141 3115.487
## 3 50 10433.49 0.01308168 3082.990
## 3 100 10497.16 0.01130203 3106.313
## 3 150 10544.77 0.01005860 3132.840
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 1, shrinkage = 0.1 and n.minobsinnode = 10.
proc.time()-startTimeModule
## user system elapsed
## 168.514 0.416 170.735
email_notify(paste("Stochastic Gradient Boosting Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@60f82f98}"
results <- resamples(list(LR=fit.lm, RIDGE=fit.ridge, LASSO=fit.lasso, EN=fit.en, CART=fit.cart, kNN=fit.knn, SVM=fit.svm, BagCART=fit.bagcart, RF=fit.rf, GBM=fit.gbm))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: LR, RIDGE, LASSO, EN, CART, kNN, SVM, BagCART, RF, GBM
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LR 2855.464 2909.083 3012.243 3053.665 3108.379 3569.786 0
## RIDGE 2859.562 2907.244 3011.504 3021.739 3105.940 3254.649 0
## LASSO 2855.447 2908.832 3012.244 3026.996 3105.958 3306.624 0
## EN 2857.180 2902.507 3011.071 3018.860 3102.361 3267.184 0
## CART 2963.559 3010.490 3126.046 3112.198 3168.049 3355.742 0
## kNN 3113.805 3186.286 3263.199 3288.142 3356.803 3529.913 0
## SVM 2280.032 2390.282 2512.550 2487.661 2588.172 2713.174 0
## BagCART 2870.441 2960.687 3096.382 3073.201 3159.055 3347.272 0
## RF 2945.335 3040.990 3106.764 3122.195 3208.636 3375.957 0
## GBM 2859.981 2927.264 3045.145 3039.084 3121.259 3266.608 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LR 6468.860 7477.418 9349.197 11086.22 13456.45 23302.46 0
## RIDGE 6475.178 7472.260 9347.424 10327.80 13456.78 15735.72 0
## LASSO 6468.851 7477.381 9347.849 10331.70 13456.43 15760.19 0
## EN 6462.406 7468.497 9338.927 10323.25 13462.80 15752.18 0
## CART 6553.440 7633.558 9390.836 10405.46 13515.17 15792.26 0
## kNN 7549.125 8399.398 9916.191 11012.61 13952.92 16204.18 0
## SVM 6496.521 7628.757 9450.810 10426.44 13565.78 15796.58 0
## BagCART 6476.117 7857.534 9396.113 10453.83 13596.12 15816.49 0
## RF 6480.871 7502.368 9344.685 10332.95 13479.44 15661.77 0
## GBM 6444.068 7519.640 9333.632 10339.66 13504.61 15748.42 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu.
## LR 3.742948e-05 0.013514075 0.026757863 0.024020521 0.034991505
## RIDGE 7.987339e-03 0.013506063 0.026533517 0.025314090 0.035623522
## LASSO 5.490973e-03 0.013515667 0.026756875 0.024630078 0.035134783
## EN 6.501294e-03 0.012847109 0.026507776 0.026475414 0.035127230
## CART 1.789651e-03 0.006260957 0.008512304 0.009551093 0.012280199
## kNN 5.790300e-04 0.001478678 0.003730796 0.004149603 0.006020437
## SVM 8.804711e-03 0.011775682 0.017012320 0.019575044 0.022351600
## BagCART 1.803747e-03 0.006538819 0.008017699 0.010793490 0.013587626
## RF 9.393580e-03 0.015251281 0.024254548 0.023872943 0.032494516
## GBM 6.540232e-03 0.010230344 0.023394023 0.022580451 0.034540579
## Max. NA's
## LR 0.043137813 0
## RIDGE 0.044167729 0
## LASSO 0.043143424 0
## EN 0.058046151 0
## CART 0.019445628 4
## kNN 0.009344345 0
## SVM 0.043237416 0
## BagCART 0.026706985 0
## RF 0.040481466 0
## GBM 0.037784149 0
dotplot(results)
cat('The average RMSE from all models is:',
mean(c(results$values$`LR~RMSE`, results$values$`RIDGE~RMSE`, results$values$`LASSO~RMSE`, results$values$`EN~RMSE`, results$values$`CART~RMSE`, results$values$`kNN~RMSE`, results$values$`SVM~RMSE`, results$values$`BagCART~RMSE`, results$values$`RF~RMSE`, results$values$`GBM~RMSE`)))
## The average RMSE from all models is: 10503.99
email_notify(paste("Baseline Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@6108b2d7}"
After we achieve a short list of machine learning algorithms with good level of accuracy, we can leverage ways to improve the accuracy of the models.
Using the two best-perfoming algorithms from the previous section, we will Search for a combination of parameters for each algorithm that yields the best results.
Finally, we will tune the best-performing algorithms from each group further and see whether we can get more accuracy out of them.
# Tuning algorithm #1 - ElasticNet
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(lambda=c(0.001,0.01,0.1,1))
fit.final1 <- train(targetVar~., data=xy_train, method="ridge", metric=metricTarget, tuneGrid=grid, trControl=control)
plot(fit.final1)
print(fit.final1)
## Ridge Regression
##
## 27752 samples
## 47 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.001 10361.45 0.02425259 3025.748
## 0.010 10535.44 0.02419578 3035.269
## 0.100 10327.80 0.02531409 3021.739
## 1.000 11228.92 0.02445028 3108.861
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
proc.time()-startTimeModule
## user system elapsed
## 26.887 0.069 27.268
email_notify(paste("Algorithm #1 Tuning Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@6aaa5eb0}"
# Tuning algorithm #2 - LASSO
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(fraction=c(0.1,0.4,0.7,1.0))
fit.final2 <- train(targetVar~., data=xy_train, method="lasso", metric=metricTarget, tuneGrid=grid, trControl=control)
plot(fit.final2)
print(fit.final2)
## The lasso
##
## 27752 samples
## 47 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.1 10331.70 0.02463008 3026.996
## 0.4 10328.45 0.02495078 3021.591
## 0.7 10949.14 0.02403318 3049.885
## 1.0 11086.22 0.02402052 3053.665
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.4.
proc.time()-startTimeModule
## user system elapsed
## 9.129 0.036 9.274
email_notify(paste("Algorithm #2 Tuning Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@5f2050f6}"
# Tuning algorithm #3 - ElasticNet
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(lambda=c(0.001,0.01,0.1,1), fraction=c(0.1,0.4,0.7,1.0))
fit.final3 <- train(targetVar~., data=xy_train, method="enet", metric=metricTarget, tuneGrid=grid, trControl=control)
plot(fit.final3)
print(fit.final3)
## Elasticnet
##
## 27752 samples
## 47 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.001 0.1 10353.86 0.02406829 3067.740
## 0.001 0.4 10320.80 0.02664446 3016.074
## 0.001 0.7 10869.15 0.02450979 3044.051
## 0.001 1.0 10361.45 0.02425259 3025.748
## 0.010 0.1 10362.87 0.02361705 3076.726
## 0.010 0.4 10321.73 0.02673645 3019.238
## 0.010 0.7 10587.71 0.02495672 3033.208
## 0.010 1.0 10535.44 0.02419578 3035.269
## 0.100 0.1 10368.39 0.02361158 3083.007
## 0.100 0.4 10322.69 0.02681851 3023.327
## 0.100 0.7 10393.32 0.02537044 3024.840
## 0.100 1.0 10327.80 0.02531409 3021.739
## 1.000 0.1 10367.25 0.02430050 3082.299
## 1.000 0.4 10332.78 0.02526869 3032.705
## 1.000 0.7 10893.06 0.02461167 3070.876
## 1.000 1.0 11228.92 0.02445028 3108.861
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.4 and lambda = 0.001.
proc.time()-startTimeModule
## user system elapsed
## 27.609 0.199 28.137
email_notify(paste("Algorithm #3 Tuning Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@62043840}"
results <- resamples(list(RIDGE=fit.final1, LASSO=fit.final2, ENET=fit.final3))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: RIDGE, LASSO, ENET
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RIDGE 2859.562 2907.244 3011.504 3021.739 3105.940 3254.649 0
## LASSO 2855.453 2909.063 3012.243 3021.591 3107.380 3250.385 0
## ENET 2852.412 2902.851 3010.354 3016.074 3099.932 3251.209 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RIDGE 6475.178 7472.260 9347.424 10327.80 13456.78 15735.72 0
## LASSO 6468.849 7477.420 9348.695 10328.45 13456.44 15725.85 0
## ENET 6455.963 7471.497 9336.962 10320.80 13462.75 15726.91 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## RIDGE 0.007987339 0.01350606 0.02653352 0.02531409 0.03562352 0.04416773
## LASSO 0.007234833 0.01351514 0.02675759 0.02495078 0.03504470 0.04314156
## ENET 0.008412791 0.01287458 0.02712340 0.02664446 0.03569260 0.05518764
## NA's
## RIDGE 0
## LASSO 0
## ENET 0
dotplot(results)
Once we have narrow down to a model that we believe can make accurate predictions on unseen data, we are ready to finalize it. Finalizing a model may involve sub-tasks such as:
predictions <- predict(fit.final3, newdata=xy_test)
print(RMSE(predictions, y_test))
## [1] 13049.92
print(R2(predictions, y_test))
## [1] 0.01219678
startTimeModule <- proc.time()
library(elasticnet)
## Loading required package: lars
## Loaded lars 1.2
set.seed(seedNum)
totCol <- ncol(xy_train)
totAttr <- totCol-1
#finalModel <- enet(xy_train[,1:totAttr], xy_train[,totCol], lambda=0.001)
#summary(finalModel)
proc.time()-startTimeModule
## user system elapsed
## 0.025 0.000 0.025
#saveRDS(finalModel, "./finalModel_Regression.rds")
proc.time()-startTimeScript
## user system elapsed
## 52668.197 57.012 53334.342
email_notify(paste("Model Validation and Final Model Creation Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@3f2a3a5}"